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Abstract 

Let (V, E) be a graph with vertex set V and edge set E. Let (X, X' , Y) £ V x 
V x { — 1, 1} be a random triple, where X, X' are independent uniformly distributed 
vertices and Y is a label indicating whether X, X' are "similar" (Y = +1), or not 
(y = —1). Our goal is to estimate the regression function 

S*(u,v) = E(Y\X = u,X' = v),u,ve V 

based on training data consisting of n i.i.d. copies of (X,X',Y). We are interested 
in this problem in the case when S* is a symmetric low rank kernel and, in addition 
to this, it is assumed that S* is "smooth" on the graph. We study estimators based 
on a modified least squares method with complexity penalization involving both the 
nuclear norm and Sobolev type norms of symmetric kernels on the graph and prove 
upper bounds on Z^-type errors of such estimators with explicit dependence both 
on the rank of S 1 * and on the degree of its smoothness. 



1 Introduction 



Let G = (V, E) be a graph with vertex set V and edge set E, card(V) = m. Let 
A := (a(u,v)) UyV £v be the adjacency matrix of G, that is, a(u,v) = 1 if u and v are 
connected with an edge and a(u, v) = otherwise. Let A := D — A be the Laplacian 
of G, D being the diagonal matrix with the degrees of vertices on the diagonal. Let 
(X, X', Y) G V x V x {—1, 1} be a random triple with X, X' being independent vertices 
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sampled at random from the uniform distribution II on V and Y being an "indicator" of a 
symmetric binary relationship between X, X' called in what follows a "similarity" . More 
precisely, Y = +1 indicates that the vertices X, X' are similar and Y = — 1 indicates that 
they are not. The conditional distribution of Y given X, X' is completely characterized 
by the regression function 

S*(u, v) := E(Y\X = u,X' = v),u, v € V 

that is assumed to be a symmetric kernel onVxV and will be called the similarity kernel. 
It is well known that sign(S*(X, X')) is the Bayes classifier, that is, the best possible 
predictor of Y based on an observation of X, X' in the sense that it minimizes the 
generalization error F{Y / g(X,X')} over all possible predictors g :V xV ^ {—1, 1}. 
Our goal is to estimate 5* based on the training data (Xi, X[, Yi), . . . , (X n , X' n , Y n ) 
consisting of n i.i.d. copies of (X,X',Y). We are especially interested in the class of 
problems such that, on the one hand, S* is a matrix (kernel) of relatively small rank 
and, on the other hand, 5* possesses certain degree of smoothness on the graph. 

Throughout the paper, Sy denotes the linear space of symmetric kernels S : VxV >->■ 
R, S(u,v) = S(v,u),u,v G V, that can be also viewed as real- valued symmetric m x m 
matrices. For S 6 Sy, let rank(S') denote the rank of S and tr(5) denote the trace 
of S. The spectral representation of S has the form S = Y^j=\ a j(.' l Pj ® ^3)1 where r = 
rank(S'), u\ < ■ ■ ■ < o~ r are non-zero eigenvalues of S (repeated with their multiplicities) 
and ipi , . . . , ip r are the corresponding orthonormal eigenfunctions (there is a multiple 
choice of ^s in the case of repeated eigenvalues). We also use the notation sign(S') := 
Yl V j=i s ig n ( cr j)(^j ® i'j) an d we define the support of S, denoted by supp(S'), as the linear 
span of {tpi, . . . ,ip r } in R v . 

For 1 < p < 00, the Schatten p-norm of S £ Sy is defined as 

ws\\ p ■.= (tv(\s\n) 1/p = ^Ehi p j , 

where \S\ := For p = 1, || • ||i is called the nuclear norm, while, for p = 2, || • H2 is the 
Hilbert-S chmidt or Frobenius norm, that is, the norm induced by the Hilbert-Schmidt 
inner product which will be denoted by (•,•). The operator or spectral norm is defined 
as ||5|| := maxj |<7j|. 

Let us also denote by II 2 := IT x II the distribution of random couple (X, X') in 
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V x V and let ||£ , ||_L 2 ( n 2) be the L2(H )-norm of kernel S : 



\S\\Lmi) = / \S(u,v)\ 2 U 2 (du,dv)=E\S(X,X')\ 2 . 



VxV 



The corresponding inner product is denoted by (•, •)l 2 (W 2 )- Clearly, under the assumption 
that the distribution II is uniform in V, we have H^Hi^m 2 ) = m_2 ||'S'll2 an d (Si, S2) l 2 (u 2 ) = 
m- 2 (S l ,S 2 ). 

The smoothness of a symmetric kernel 5:FxFi->1 can be characterized in terms 
of Sobolev type norms \\AP/ 2 S\\l for some p > 0. Note that if 5 is a kernel of rank r with 
spectral representation S = Y^k=i f^ki^k ® V'fc); then0 

m m 

\\AP/ 2 S\\ 2 = tr(AP/ 2 S 2 AP/ 2 ) = tr(A^ 2 ) = J> 2 fe (A*> fc ,^) = £ ^[|A p/ V*f , 

fe=i fe=i 

so, essentially, the smoothness of the kernel S depends on the smoothness of its eigen- 
functions f/'fc on the graph. In particular, for p = 1, we have 

m 

\\A 1 Ps\\l = ^fi^\M«)-M'>)\ 2 , 

where the sum is over the couples of vertices connected with an edge. 

Given a kernel S, let L n (S) denote the following penalized empirical risk: 



XD 



r, n 

Ln{S) := ||5||i 2(n2) - -^YjSiX^X}) + e\\Sh + e\\W 1/2 S\\l 2 ^ 
1 j=i 

\\S\\l 2{n , ) --jZy3S{X j ,X' ) + e\\S\\ 1 +e4W l l 2 S\\ 2 
n 3=1 

where W = dA p for some constants d > and p > 0, e, e > are regularization 
parameters and e\ = We will study the following estimation method: 

S := argmm s&3l L n (S), (1.2) 

where B is a closed convex subset of the linear space Sy of all symmetric kernels. Note 
that there are two complexity penalties involved in the definition of penalized empirical 
risk (jl.ip . The first penalty is based on the nuclear norm and it is used to "promote" 
low rank solutions. The second penalty is based on a "Sobolev type norm" HVF 1 / 2 ,!?!! 2 ,. 



1 Below || ■ || denotes the Euclidean norm in R v ; there is a little abuse of notation here since we also 
denote the operator norm by || • ||. 
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It is used to "promote" the smoothness of the solution on the graph. In principle, W in 
the definition of L n (S) could be an arbitrary symmetric nonnegatively definite matrix. 
Therefore, alternative interpretations of the problem under consideration are possible 
(such as, for instance, learning similarities on weighted graphs). 

We will derive an upper bound on the error \\S ^*llz/2(n^) — ^11 ^*ll2 
estimator S in terms of spectral characteristics of the target similarity matrix 5* and 
matrix W. Before stating the main results, let us recall recent advances on low rank 
matrix completion problems in which the approach based on nuclear norm penalization 
has been crucial. 

Suppose first that a symmetric kernel S** G Sy is observed at random points 
(Xj,X'j),j = l,...,n, where Xj,X'j,j = l,...,n are independent and sampled from 
the uniform distribution II in V. In this case, V is an arbitrary finite set of cardinality 
m and the set of edges E is not specified. It is assumed that Yj = S*(Xj,X'j), so, there 
is no errors in the observations. In such a noiseless case, the following method is used to 
recover S* based on the observations (X\, X[,Yi), . . . , (X n ,X' n , Y n ) : 

S := argmin{||5||i : S G S v ,S(Xj,X^ = Yj,j = 1, . . . ,n}. 

Such methods of recovery of low rank target matrices S* have been extensively studied 
in the recent literature (see Candes and Recht (2009), Recht, Fazel and Parrilo (2010), 
Candes and Tao (2010), Gross (2011) and references therein). It is easy to see that there 
are low rank matrices 5* that can not be recovered based on a random sample of n entries 
unless n is very large (comparable with the total number of entries of the matrix). Indeed, 
consider S* such that, for given u, v G V, S*(u,v) = S*(v,u) = 1 and S*(u',v') = 
otherwise. For this rank 2 matrix, the probability that the two "informative" entries are 
not present in the sample is (1 — ^i) n , which is close to 1 if n = o{m 2 ). Such sparse 
low rank matrices should be excluded to make it possible to recover the target low rank 
matrix based on relatively small samples of entries. This is done by introducing so called 
low coherence assumptions. Let {e v :»£ V} be the canonical orthonormal basis of R v 
equipped with the standard Euclidean inner product. Given a linear subspace L C R^, 
denote by L 1 - the orthogonal complement of L and by Pl the projector onto the subspace 
L. Let L := supp^*), r = rank(S'*) and suppose there exists a constant v > 1 (coherence 
coefficient) such that 

\\PLe v \\ 2 < — , v G V and |(sign(S;)e n ,e„)| 2 < —^,u,v G V. (1.3) 

The following result is due to Candes and Tao (2010) and Gross (2011) (we state 
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here a version of Gross that is an improvement of an earlier result of Candes and Tao 
with significant simplification of the proof). 



Theorem 1 Suppose conditions hold for some v > 1. Then, there exists a constant 
C > such that, for all n > Cvrm log 2 m, S = S* with probability at least 1 — m~ 2 . 

Thus, if, for the target matrix S*, the coherence coefficient v > 1 is relatively small, 
the nuclear norm minimization algorithm (|1.2p does provide the exact recovery of S* as 
soon as the number of observed entries n is of the order mr (up to a log factor). 

In the case when Yj are noisy observations of S* (Xj , X'- ) with 

E(Yj\Xj = u,X'j = v) = S*(u,v), 

one can use the following estimation method based on penalized empirical risk minimiza- 
tion with quadratic loss and with nuclear norm penalty: 



S := argmin 5g5v . 



n-^iYj-SiX^X'^+eWSl 
i=i 



;i-4) 



This method has been also extensively studied for the recent years, in particular, by 
Candes and Plan (2011), Rohde and Tsybakov (2011), Negahban and Wainwright (2010), 
Koltchinskii, Lounici and Tsybakov (2011), Koltchinskii (2011b). It was also pointed 
out by Koltchinskii, Lounici and Tsybakov (2011) that in the case of known design 
distribution II (which is the case in our paper) one can use instead of (jl.4|) the following 
modified method: [§ 



S := argmin 5e5v 



\ s \\lm-lt, Y ^ x ^ x '^ + e \\ s h 

3=1 



;i.5) 



Clearly, (II. 5p is equivalent to method (II. 2p defined above for e = 0. 

When the observations \YA < 1, j = 1, . . . , n (for instance, when Yj G {—1, 1}, which 
is the case studied in the paper), the next result follows from Theorem 4 in Koltchinskii, 
Lounici and Tsybakov (2011). 

Theorem 2 For t > 0, suppose that 



£> , / t + log(2m) w 2(t + log(2m)) 

V V rum V 



nm " n 



2 Note that, if the norm HiSllwri 2 ) m the definition below is replaced by the Z/2(n„.)-norm, where Tl n 
is the empirical distribution based on (Xi, X[), . . . , (X n ,X' n ), then the resulting estimator coincides with 

Oil. 
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Then with probability at least 1 — e 

\\S ~ S *Wl 2 (u) - (~^~) raVrank^*). 

Our main goal is to show that this bound can be improved in the case when the 
target kernel S*, in addition to having relatively small rank, is also smooth on the graph 
and when the estimation method (jl,2|) is used with a proper choice of regularization 
parameters e, e. 

2 Main Results 

Suppose that W has the following spectral representation: W = ^2T=1 ^ki'Pk ® 0fc)> 
where < Ai < • • • < A m are the eigenvalues of W (repeated with their multiplicities) 
and (pi, . . . ,(f) m are the corresponding orthonormal eigenfunctions (of course, there is a 
multiple choice of <pk in the case of repeated eigenvalues). Let ko be the smallest k such 
that Afc > 0. We will assume that for some (arbitrarily large) £ > 1 A m < and 
Afc > m~^. In addition, it is assumed that s i->- j- is a nonincreasing sequence, that, for 
all A; = fco, . . . , m — 1, Afc+i < cA^, and, that, for all s > ko, 

m 
fe=s 

with a constant c > 0. 

Suppose now that the spectral representation of 5* is 5* = Ylk=i t J, k(' l l 3 k®ipk)> where 
r = rank(S r *) > 1, are non-zero eigenvalues of 5* (possibly repeated) and V'fc are the 
corresponding orthonormal eigenfuctions. Denote L := supp(S'*). Let <p be an arbitrary 
nondecreasing function such that k i— )■ is nonincreasing and 

^||Pl^|| 2 < ¥>(*:),* = 0,1,... ,m. 
i=i 

We will denote by VP = vPs^w the class of all the functions satisfying these properties. 
Often, it will be convenient to extend a function ip 6 VP to nonnegative real numbers 
by making it linear in each of the intervals [k, k + 1], k = 0, 1, . . . , m — 1 and setting 
<p(u) = (p{m) for all u > m. Such an extension will be also denoted by cp. It is easy to 
see that the extension is a nondecreasing function in M + and the function u i— > is 
nonincreasing. 
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The following coherence function will be crucial in our analysis: 

1 j 

<p(k) := <p(S*, k) := max t max - ||Pk0j|| 2 , k = 1, . . . , m, <p(0) = 0. 
t<k j>t 7 

J i=i 

It is straightforward to check that <p E ^ and, for all ^£$, < <p(k), k = 0, . . . ,m. 
Thus, if is the smallest function Also, <^(m) = r since Y^JLi W-^L^jW 2 = = r - 

Moreover, since is nonincr easing, we have 

rk 

<p[k) — — > k = 0, . . . , m. 

m 

Given t > 0, let t n>m := t + iog(2m(iog 2 (4nW 3/2)c: ) + 2)). We will assume in what 
follows that mt n m < n and set 



_ | /i + log(2m) 



nm 



Theorem 3 There exists constants C, C\ depending only on c such that, for all s £ 
{ko + 1, . . . ,m + 1} and all e 6 [A^ 1 , Ajj^-jjj with probability at least 1 — e _i , 



lie c Il2 ^ n 1 ? "*i s ) m tn,m -ii T r/l/2c II 2 ,/-t II r> \\2f m ^n,m\ 2 , n n 

Ip-^IIl2(iP) ^ C + £ \\W 1 Ssll^n^+CimaxllPi^ll {—^) • (2-2) 

Remarks. Note that max„ e y II-PlCdII 2 < 1- Thus, the last term in the righthand 
side of bound (|2.2p is smaller than the first term, provided that 

mt r 



b n,m 



n 

[2 



Moreover, this term is much smaller under a low coherence condition max^gy H-Pl&uII < 
^ for some v > 1 (see conditions (II. 3jl ). In this case, 



n r> mt njm \2 vrmt 2 n vrt n , m 
max P L eJ — < ;r J — < 



Dgy " " " v n / n 2 n 

Note also that Theorem [3] holds in the case when e = 0. In this case, s = m and 
<p(S*,m) = r, so the bound of Theorem [3] becomes 

11^ ~ ^IlLcn 2 ) - C — ( 2 - 3 ) 

which also follows from the result of Koltchinskii, Lounici and Tsybakov (2011) (see 
Theorem [2] in Section 1). 



Here and in what follows, we use a convention that A m +i = +00 and X^, 1 = 0. 
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The function (p involved in the statement of the theorem has some connection to 
the low coherence assumptions frequently used in the literature on low rank matrix 
completion. To be specific, suppose that, for some v > 1, 

k , 
V||Pl^|| 2 <— ,k = l,...,m. (2.4) 
i=i 

Then 

<p{k) < , k = 1, . . . , m. 

m 

A part of standard low coherence assumptions on matrix S* with respect to the orthonor- 
mal basis {4>k} is (see (jl.3p ) 

\\PM\ 2 <— ,k = l,...,m 
m 

and it implies condition (j'2.4|) that can be viewed as a weak version of low coherence. 
Under condition (|2.4p . the following corollary of Theorem [3] holds. 



Corollary 1 Suppose that condition \2.J$ holds. Then, there exists a constant C > 
depending only on ( such that, for all s E {ko + 1, . . . , m + 1} and all e E (A.T 1 , 
with probability at least 1 — e _i , 



Note that, if X k X k 2/3 for some j3 > 1/2, then the choice of s that minimizes 
the bound of Corollary Q] is s x (— y- — ) || W 1 / 2 ^*!! r^m +1 \ which, under a low 



^vrt n , m J ii *iiL 2 (n) 

coherence assumption max Be y ||P£,e,t,|| 2 < yields the bound 

/ . \ 2/3/(2/3+1) 

\\s - S4l m < c{^) \\w^s4l% +1) . (2.5) 

The advantage of (|2.5p comparing with (|2.3p (that holds for e = and does not rely on 
any smoothness assumption on the kernel 5*) is due to the fact that there is no factor 
m in the numerator in the right hand side of (|2.5p . Due to this fact, when m is large 
enough and v is not too large, bound (|2.5p becomes sharper than (|2.3p . 

3 Proofs 

Proof of Theorem [31 Bound (|2.2p will be proved for an arbitrary function c/? E ^s^vy 
with = r,k > m instead of (p. It then can be applied to the function (p (which 
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is the smallest function in ^s*,w)- We will also assume throughout the proof that s E 
{ko, . . . , m} and e G [AJ, l3 A7 1 ] (at the end of the proof, we replace s + 1 1— > s). 

Denote V L (A) := A-P L ±AP L ±, p£(A) = P l ±AP l ±,Ag S v . Clearly, this defines 
orthogonal projectors Vl,Pl in the space Sy with Hilbert-Schmidt inner product. We 
will use the following well known representation of subdifferential of convex function 
S 1 — y \\S\\i : dWSWx = (sign(S) + Vi{M) : M G Sv, \\M\\ < 1} , where L = supp(S) 
(see Koltchinskii (2011b), Appendix A. 4 and references therein). An arbitrary matrix 
A G dL n (S) can be represented as follows: 



2 A 2 " 



m 1 - n 

i=l 



where V G and = E V:U = \{e u ® e„ + e v ® e u ). Since S is a minimizer of 

L n (S), there exists a matrix A E dL n (S) such that —A belongs to the normal cone of B 
at the point S (see Aubin and Ekeland (1984), Chap. 2, Corollary 6). This implies that 
(A, S — 5*) < and, in view of (|3.ip . 



2(S, S - S*) L2m YiE XitXl ,S-S*)+ e(V, S-S,) + 2e 1 (WS, S-S*)<0 



n 
i=l 



It follows by a simple algebra that 

2||5 - S4 2 Lam + 2e 1 \\W 1 /\S - S*)\\\ + e(V, S - 5,) 
< -2e 1 {S»,W(S - S^)) + 2{E,S- 



(3.2) 



where 



n . 

3=1 

Note that (H, S) = £ £"=1 (^5(X„ Xj) - EY,S(X, X')) . 

On the other hand, let K G 9||5*||i. Therefore, the representation V* = sign (5*) + 
V^(M) holds, where M is a matrix with ||M|| < 1. It follows from the trace duality 
property that there exists an M with \\M\\ < 1 such that 

(v£(M),s-s*) = (M,v£(s-s*)) = (m,v£(s)) = \\v£(s)h 

where in the first equality we used that is a self-adjoint operator and in the sec- 
ond equality we used that 5* has support L. Using this equation and monotonicity of 
subdifferentials of convex functions, we get 

(sign(S*), S - S*) + ||7^(5)||i = (K, S-S)<(V,S- S*) 
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Substituting this in (|3.2p . it is easy to get 

2\\S - S4l 2(u2) + eWVt&h + 2e 1 \\W l ' 2 (S - S*)f 2 < (3.3) 
-e(sign(^), S-S*)- 2e 1 {W 1/2 S„ W 1/2 (S - 5*)> + 2(S, 5 - 5*) 

We will bound separately each term in the right hand side. First note that 
e\(siga(S,),S - S*)\ < s\\sign(S*)\\ 2 \\S - S,|| a 

= e^m\\S - S4 L2{u2) < \rm 2 e 2 + \\\S - S,[|£ a(na) . (3.4) 

We will also need a more subtle bound on (sign(5*), S — 5 1 *), expressed in terms of 
function ip. Note that, for all ko < s < m, 

m 

(sign(5*), S - S*) = y^(sign(^)0fc, (S - 5*)^ fe ) = 

k=l 

^(sign(S*)(j)k, (S - S*)(f) k ) + E ( Slgn ^fl^ fc ) y / A^(5 - S*)4> k 
k=i k=s + i\ ^ Xk 

which easily implies 

/ s \l/2/ a n 1/2 

\( S xgn(S*),S-S*)\ < (^||sign(5,)^|| 2 ) £ S.)fcll a +(3-5) 



E IM ^H 1/2 ( E vL/2 



2 , < 



fc=s+l ' v fc=s+l 

' s \ V 2 / m II P ^ l|2\ V 2 

E h^^ii 2 ) + ( E M -^J ii^ 1/2 (^ - ^)ii 2 - 

We will now use the following elementary lemma. 
Lemma 1 Let c be the constant from condition \2.1\) . For all s > k$ — 1, 

f E^< (c + 2) ^±11. 

Proof. Denote F s := YZ=l \\ P L<l>k\\ 2 , a = l,...,m. Then, using the properties of 
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function ip G we get 

|2 m_1 



E 

fc=s+l 
m— 1 



j^^lj \Afc ^k+i) A m A s+ i 



< 



k=s+l 

s + 1 



yH < y(g + 1) 

A m s + 1 



m— 1 



V — ) + — 



< 



k-(k-l) (s + 1) m 

E — ^ — - + ^ — - + 



k=s+2 



A/, 



A s +i A 

7 



s + 1 



E i + 



k=s+2 



Afc A, 



s + 1) m 
+ 



A, 



Using the assumptions on the spectrum of W (in particular, condition (|2.1[) ). we conclude 
that 



E 



WPM 1 < ¥>(»+!) 



k=s+l 
ending the proof. 



s + 1 



s + 1 (s + 1) m 
c- + + + 



\s+l 



A s +i X r . 



<(c + 2) 



A s +i 



It follows from from (|3.5p and the bound of Lemma Q] that 



|<sign(S*), S - S*)\ < V^)\\S ~ S4 2 + W(c + 2) ^ + 1 W 1/2 (>g " 

V A «+l 



mv^(i)||5 - S,|| L2(n2) + mJ (c + 2)^±^\\W 1 '\S - S*)^^). 



This implies the following bound: 

e|(sign(S*),5-S*}| < 



(3.6) 



(3.7) 



V ( a )m 2 e 2 + j||5-5.||^ (lP) + (c + 2) 



1 



ip(s + 1) m 2 e 2 ( e 
A s +i e 



+ Z ll^ 1/2 (5-^)||| 2{n2) , 



where we used twice an elementary inequality ab < a + jb ,a,b > 0. Since, under the 
assumptions of the theorem, e\ s +i > 1, (|3.Tj) yields the following bound: 



e|(sign(5»),5-5»)| < (3.8) 
(c + 3)(^(s + l)mV + ^IIS- - 5*||i 2(n2) + |||^ X / 2 (S - ^)||| 2{n2) . 

To bound the second term in the right hand side of (|3.3p . note that 

KW^^.W 1 / 2 ^-^*))] < WW^S.HW^iS-S^h, (3.9) 
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which implies 



e 1 \(W 1/2 S^,W 1/2 (S - S*))\ < e^W^S^Wl + j||W 1/2 (5 - S*)\\ 2 2 



\W l/2 S* 



IXa(lP) ■ 4 



lL 2 (n 2 )- 



(3.10) 



Finally, we bound (5, S — S 1 *) : 

1(3, S - S.)\ < \(E,V L (S - &)>| + l(3,^(5))| 
< 1(^3,5-^)1 + 1131111^(5)11!. 



(3.11) 



To bound ||3||, we use a version of noncommutative Bernstein inequality of Ahlswede 
and Winter (2002) (see also Tropp (2010), Koltchinskii (2011a, 2011b, 2011c) for other 
versions of such inequalities) . 



\EZ 2 



Lemma 2 Let Z be a bounded random symmetric matrix with EZ = 0, o 2 z :- 
and \\Z\\ < U for some U > 0. Let Z±, . . . , Z n be n i.i.d. copies of Z . Then for all t > 0, 
with probability at least 1 — e* 



1 n 



8=1 



<2 [a z 



t + log(2m) v / t + log(2m) 



■/?. 



n 



It is applied to i.i.d. random matrices Z{ := YiE Xi) x' ~ ^O^i^XiX')? * = 1, 
Since ||Z,|| < 2 and, by a simple computation, <r^. := ||EZ?|| < 1/m (see, e.g., Koltchin- 
skii (2011b), Section 9.4), Lemma [2] implies that with probability at least 1 — e _< 



1 n 

n 



Under the assumption that 

£ > 4 

this yields ||3|| < e/2 and 



< 2 



i + log(2m) y2(t + log(2m)) 



nm 



t + log(2m) y 2(t + Iog(2m)) 



nm 



|(3,S-5,)|<|(P i S,5-^)| + -||^(5)|| 1 . 



(3.12) 



For simplicity, it is assumed that n > 2m(i + log(2m)). In this case, one can take 
e = 4y t+log ^ m ^ , as it has been done in the statement of the theorem. 
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We have to bound |("PlH, 5 — S*)\ and we start with the following simple bound: 

\(V L E, S-S.)\< m\\V L E\\ 2 \\S - S4 L2m 

< mv2r||H||||£ — S*\\L 2 (n 2 ) 

< ^meV2r\\S - llz, 2 (n 2 ) (3-13) 

< I m 2 e 2r + i||5-5*||2 2(n2) , 

where we use the fact that rank(P L H) < 2r. Substituting ([331), pTOl) . (f3TT2j) and (I3TT31) 
in (|3,3p . we easily get that 

\\S - S4l 2{n2) < *-re 2 m 2 + 2e\\W 1 l 2 S4l 2[n2y (3-14) 

For e = 0, this bound follows from the results of Koltchinskii, Lounici and Tsybakov 
(2011). However, we need a more subtle bound expressed in terms of function <p, which 
is akin to bound f|3.8|) . To this end, we will use the following lemma. 



Lemma 3 For 5 > 0, let k(5) be the largest value of k < m such that A fc 1 > 5 2 (if 
A^ 1 < 5 2 , we set k(5) = 0). For all t > 0, with probability at least 1 — e - *, 



\\m\\ 2 <sa\w^m\\ 2 <i v nm veV ~ n 



sup \(V L E,M)\ < 2y/{Ac + %)\j^ zr 5y/ip(k{5) + 1) + 2^28 max \\P L e v \\^, 



provided that k(5) < m, and 



rt /-„ „„ „ t 



\(V L E,M}\ <4V25\ + 2V25 max \\P L e 



nm vev ' n 

provided that k(5) > m. 

Proof. The proof is somewhat akin to the derivation of the bounds on Rademacher 
processes in terms of Mendelson's complexities used in learning theory (see, e.g., Propo- 
sition 3.3 in Koltchinskii (201lb)). 

Note that, for all symmetric m x m matrices M, 

in 

(V L a, M)=J2 (VlZ, fc ® <j>j)(M, <p k ® (f>j)- 
kj=i 

Suppose that 

m 

\\M\\ 2 2 = Yl \(M,(f>k®<t>j)\ 2 <5 2 
k,j=i 
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and 

m 

\\W l / 2 M\\l= ^k\(M,4> k ® 4>j)\ 2 < 1. 
Then, it easily follows that 

\{M, (p k <g) 0j)| 2 



E 



Ar 1 a o" 2 



< 2, 



which implies 

|(P L H,M)| < 



(3.15) 



E(Va5 2 )|(PlH, 

fc,1 = l 



1/2 



< 



v fcj=i 

Define now the following inner product: 

m 

{M l ,M 2 ) w := Yj (K 1 f\S 2 )(M l ,fa®4 j )(M 2 ,cl) k ®<P j ) 
k,j=l 

and let || • ||^ be the corresponding norm. We will provide an upper bound on 

/ m \ 1/2 

\\VA=[ ^(A^A<5 2 )|(P L H,^®^)| 2 • 
\j=i J 

To this end, we use a standard Bernstein type inequality for random variables in a 
Hilbert space. It is given in the following lemma. 

Lemma 4 Let £ be a bounded random variable with values in a Hilbert space %. Suppose 
that E£ = 0, E||£||^ = a 2 and W^Wu < U . Let ,£n be n i.i.d. copies of TTien /or 

aZZ t > 0, wrai/i probability at least 1 — e* 



1 n 
n r— ' 



< 2 



n v n 



Applying Lemma H] to the random variable £ = YVl{Ex,x>) — ^YVl(Ex,x')> we 
get that for all t > 0, with probability at least 1 — e - *, 



1 n 

- V ^^(^.xO - eyp l (£ XiX ,; 



< 



(3.16) 



E 1 / 2 ||yp i (i? XiX ,; 



+ 



\\YV L (Ex,x> 



t 

Loo Tt 
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Using the fact that Y £ { — 1,1}, we get 

n\YPL(E x>xl )\\l = nVL(E x>x ,)f w = (3.17) 

m m 

E E ( A fc 1 A s2 )\(PUEx,X'), fa ® <f>j)\ 2 = E ( A fc 1 A 5 2 )n(Ex,x>,V L (4 k ® ^-))| 2 = 

k,j=l k,j=l 
III 

^(A^A<5 2 )m- 2 ^ K^,PL(^®^-))| a < 

k,j=l u,vGV 



2 Y.^ 1 ^ 52 wdfa®ml< 2m ' 2 E( A fc lA<52 )(ii p ^ii 2+ ii p ^n 2 ) 

k,j=l k,j=l 

m 

2 



E( A fc 1 A s2 )W P Lfa\\ 2 + 2m" 2 E( A fc 1 A 5 2 ) £ 

fc=l k=l j=l 



m m 



2m- 1 ^(A fc 1 A 5 2 )\\PM 2 + 2m- 2 ^(A, 1 A <5 2 )||P L || 2 = 

k=l k=l 

m m 

2m- 1 E( A fc 1 A S 2 )\\PM\ 2 + 2m~ 2 r J^iK 1 A <* 2 )- 

k=l k=l 

To bound E\\YV L (E X:XI )\\ 2 further, note that 

m 

YiK^s'WLfaW 2 ^* 2 E ll^ll 2 + E Vh^ii 2 - ( 3 - 18 ) 

fe=l fc<fc(<5) fc>fc(<5) 

Assuming that 1 < k(5) < m — 1, using the bound of Lemma [TJ the fact that 
A fc(5)+i ^ ^ 2 an( ^ ^ e mon °t° n i c ity °f function ip, we get from (|3.18p that 

E^ A fc 1 A 5 2 )ll^|| 2 < *V*(0) + (c + 2 f (k } 5) + 1) < 

6 2 <p(k(6)) + (c + 2)6 2 <p(k(6) + 1) < (c + 3)<5V(£;(<5) + 1). (3.19) 

It is easy to check that (I3.19P holds also for k(5) =0 and k{5) = m (in the last case, 
tp(k(5) + 1) = r). We also have 

m 

Y(^^ 2 )< E 62 + E A fc X ' 

fc=l k<k(5) k>k(S) 



which, in view of condition (|2.1|) . implies 

fc=l 



V (A7 1 A 5 2 ) < 5 2 k(5) + c ^Z±i < ( c + 1)<5 2 (A;(5) + 1). (3.20) 

Afe(5) + 1 
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Using bounds (|3.17p . (|3. 19[) and (|3.20p . we get, under the condition that k(5) < m, 



e\\yv l (e XjX ,)\\1 < 

2m' l {c + 3)5 2 (p{k{6) + 1) + 2mT 2 r{c + l)S 2 (k(5) + 1) < 

2m-\c + 3)6 2 V (k(6) + 1) + 2 m - 2 r(c + l)p J$l±L-<p(k(8) + 1) < 



(3.21) 



2m _1 (c + 3)<5V(^(<5) + 1) + 2m" 2 r(c + 1)5 2 



m 



<p(m 



■<p(k(6) + 1) 



(4c + 8)m~ l 5 2 <p{k(8) + l). 
In the case when k(6) > m, it is easy to show that 

E||m(^Jf')ll« < terCWr. 



(3.22) 



We can also bound 



\\\YVl{Exjc> 



\\YV L {E X . 



X' 



as follows: 



\\V L (E 



X,X')\\w 



k,j=l 



< 



(3.23) 



max (A, 1 A S 2 ) max V \{V L E UiV , <\) k ® <j)j)\ 2 < 

Kk<m u,vdV *■ — * 

k,j=l 

max (Ar 1 A5 2 ) max ||7- , £,-E Uj -u ||a — $ 2 max ||7- > z;(e lt ® e^)!^ < 2d 2 max HPe,^ 



Kk<m 



u,v£V 



u,v£V 



If k(S) < m, it follows from ||515j> . (13361) . (I3T2T]) and (^23]) that with probability 
at least 1 — e - *, for all symmetric matrices M with ||M||2 < 5 and HVF^Ml^ < 1, 



|(P L ~,M)| < 2^(4c + 8)\ —5y/<p(k(6) + 1) + 2\/2£ max ||P L eJ-. 

V um vev n 

Alternatively, if k(8) > m, we use (|3.22p to get 



\(V L E,M)\ < 46 J — + 2v^<f max||P L e„||-. 

nm uev n 



It follows from Lemma [3] that, for all 5 > 0, the following bound holds with proba- 
bility at least 1 — e~* 



sup \{P L E,M)\< 

\\M\\2<S,\\W 1 / 2 M\\ 2 <1 



(3.24) 



2JUc + 8)\l—5y/ip(k(6) + 1) + 2V25 max Lrr e 



t 
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(recall that ip(k) = r for k > m, so, the second bound of the lemma can be included 
in the first bound). Moreover, the bound can be easily made uniform in 5 £ [<$_, 5+] for 
arbitrary 5_ < <5+. To this end, take 5j := 8+2~ 3 ,j = 0, 1, . . . [log 2 (£+/&_)] + 1 and use 
(|3.24p for each 5 = Sj with t := t + log([log 2 (<5+/<5_)] + 2) instead of t. An application of 
the union bound and monotonicity of the left hand side and the right hand side of (|3.24p 
with respect to 5 then implies that with probability at least 1 — e~* for all 5 £ [£_, 5 + ] 

sup \(P L E,M)\< (3.25) 

\\M\\2<S,\\W l / 2 M\\ 2 <l 



t 



C\ 8Ju>(k(8) + 1) + 4^25 max ll-FfceJI-. 

V nm vev n 

where C > is a constant depending only on c. Indeed, by the union bound, (|3.24p holds 
with probability at least 



l-(pog 2 ((5 + /tf_)] + 2)e- t = l-e- 



-t 



for all 5 = 5j,j = 0,..., [log 2 (£+/<*-)] + 1 - Therefore, for all j = 0, . . . , [log 2 (<J + /<*_)] + 1 
and all 5 E <5j] 

sup \(T L Z,M)\< (3.26) 

IIAf Ib^tllWVajvtf- |j 2 <i 



i . / — _ J 



2V(4c + 8)\/ SjJipikiSj) + 1) + 2V2<5,- max ||P L e„|, 

(by monotonicity of the left hand side). Note that k(6j) < k(5) < k(5j + i). We can now 
use the fact that = ^jr--^; is a nonincreasing function and the condition Xk+i/^k < c 
to show that 



t „ r~77r, — rr _ / t 



—SjJvMSj) + 1)+ < 2\ —S j+1 ^ip(k(S j+1 ) + 1) < 
nm v V nm 



J_ <p(k(5 j+1 ) + l) <2 rJ 1 <p(k(5 j+1 ) + l) 
V nmV A fc((5j+l) ~ YnrnV A fc(<5j+l)+ i 



nm \l \k(S)+l V nm 



This and bound (|3,26p imply that 



sup |(PxH,M)|< (3.27) 

||iW|]2<<5 1 |[Vl' rl /2M||2<l 



Wc(4c + 8)\/ — 6J<p(k(5) + 1) + 4^25 max -rr e„ — , 
w nm v rv w „ e y n 
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which proves bound (|3.25[) , 

Set 5 as 

115* — 5*112 \\S — S*\\l 2 (u 2 ) 



5 :-- 



\\WW{S-S.)h \\WW{S -s*)\\ L2im 

that 

get from (l3T2"5j) that 



and assume for now that 8 G [<5_,<5 + ]. For a particular choice of M := ,f - ,g * , we 



|(P r H,S-S*)| < C\ —\\S-S4 2 ^<p(k(5) + l)+AV2max\\P L e4-\\S-S*\\ 2 . (3.28) 
V mn weV n 

Suppose now that 5 2 > e. Since, under assumptions of the theorem, e E (A s _j^, A s x ], this 
implies that k{5) < k(y/E) = s and 



t 



\(V L E,S-S*)\ < C\ \\S - S* || 2 + 1) + 4^2 max \\P L e v \\-\\S - S*|| 2 = 

V ran dgV n 

CW — US' - 5*|| i2 m 2 )V^ + 1) + 4\/2max ||Pie„|| — US' - S*||wri) < 

26 h64max Plc — + - \\S — 5* r.mav (3.29) 



In the case when <5 2 < e, we have > k(y/e) = s. In this case, we again use the fact 
that is a nonincreasing function and the condition A^+i/Afc < c to show that 

J-L\\s - S4 2y /m5) + T) = J^ww^is - s*)|| L2(n2) V<5V(M<5) + i) < 

V nm V n v ' 



mi 




n \ Afcftt V n v ; \ A fc m,i 



^^\\ W Wtf-S.)\\^J4^ < v^V^^ll^ 1/2 (5-S*)|| L2(n2)V V(ITI). 

This allows us to deduce from (|3.28[) that 

\{P L E,S-S*)\ < (3.30) 

y/doJ— VE\\W^ 2 {S - S,)|| L2 (rm vV(* + 1) + 4v / 2max||P L e,,|| — ||S - S*||i, 2 (n) < 

+ \^ 1/2 (S ~ S*)\\l 2m + 32max ||P ie ,|| 2 (^f + \\\S - S^y 
It follows from bounds (|3.29|) and (I3.30P that with probability at least 1 — e - ', 



| {P L E, S - S*) | < (2 V c)C 2 ^ + lW + 64 max ||P L e„ || 2 ( —) * + (3.31) 
\\\S~ SA\\ 2m + \e\\W l, \S - S*)||i 2(n2) , 
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provided that 

6= ll§ ~ S * h = "^-/* llL2(n2) G [*_,*+]. (3.32) 
\\WV*(S-S,)\\ 2 \\W^{S-S.)\\ Lam 

It remains now to substitute bounds (I3.8p . (|3.10p . (]3. 12|) and (I3.3ip in bound (|3.3p to 
get that with some constants C > 0, C\ > depending only on c and with probability 
at least 1 — 2e _i 

I* - V, < C ^ + 1) f + '"» + S||H^S.|li,in, + ft max l|P.e„f (^ f ) 2 , 

(3.33) 

where f m := t + log(2m). 

We still have to choose the values of 5- , <5+ and to handle the case when 

\\S-S*h ||5-^|| L2(n2 ) g [§ _ J+] , (3 . 34) 



\WV*(S - S.)\\ 2 \\WW(S- ^)|U 2(n2) 



First note that, since the largest eigenvalue of W is X m and it is bounded from above by 
m£, we have 



\\W l ' 2 {S - S,)\\ 2 < y/X^\\S - S*\\ 2 < m C/2 \\S - 5*|| 2 . 

Thus, 5 > m~^ 2 . Next note that 

\\W 1/2 S4l 2iU 2 } < m^m^S^l < m c , 

where we also took into account that the absolute values of the entries of 5* are bounded 
by 1. It now follows from (|3.14p that, under the assumption 2m ^ m < 1, 

\\S ~ S *\\l 2 (jP) < \rm 2 e 2 + 2em^ < 

9 t + log(2m) mS or 

24rm 2 ^ '- + 2— < 12m + 2m 2C < 14m 2C , 

nm X s 

which holds with probability at least 1— e~ l . Therefore, as soon as ||W 1 / 2 (5— 5*)||i 2 (n 2 ) — 
n~^, we have 5 < 4n^m^. 

We will now take <5_ := m~^/ 2 ,(5+ := An^m^. Then, the only case when ()3.34p can 
possibly hold is if ||14 7l//2 (S' — S^Hwn 2 ) < n~^. In this case, we can set 

5 := n>\\S - S*\\ L2(U 2) e [5-, 5+] 
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and follow the proof of bound (|3.3ip replacing throughout the argument ||W a1 / 2 (5 — 
S'*)||i, 2 (n 2 ) with n _< ». This yields 

\(P L Z,S-S*)\ < (3.35) 
2Vc C 2 - i — + 64max \\P L e v \\ 2 [ — + - \\S - 5* 2 9fn2) + -en 2 \ 

Bound (|3.35p can be now used instead of (|3.3ip to prove that 

IIC C [|2 <, ^( s + l)m(t + t m ) _ ||w l/ 2c |,2 Lr , || d ||2/' m *\ 2 , - -2C 

(3.36) 

with some constants C,C\ > depending only on c. 

Clearly, we can assume that C\ > 1 and t > 1. Since m < n 2 (recall that we even 
assumed that mi n . m < 1), £ > 1, max„ e y ||Pie^|| 2 > and e < A^ 1 < m^, it is easy to 
check that 

_ . 2 / rri* n 2 m m> _ 2r 
Cimax.\\P L e v \\ — > —= > > en 
neF \ n / n z n zt > 

Thus, the last term of bound (|3.36p can be dropped (with a proper adjustment of constant 

Ci). 

Note also that with our choice of <5_ , 5 + 

i=t + log(log 2 (<5+/J_ + 2) < t + log(log 2 (4n<W 3 / 2 ^) + 2) 

and i+t m < 2t n m . It is now easy to conclude that, with some constants C, C\ depending 
only on c and with probability at least 1 — 3e~* 

lid £7 112 / + i) 771 ^,™ . - Mw l/2c ||2 L/ o || d l|2/ m *\ 2 /q Q7\ 

W S ~ S *\\l 2 (jV) < C -+e\\W' 5*|| L2(n2) +CimMc||P L e„|| [—) . (3.37) 

The probability bound 1 — 3e~* can be rewritten as 1 — e~ l by changing the value of 
constants C,C\. Also, by changing the notation s + 1 4 s, bound (|3.37p yields (|2.2p . 
This completes the proof of the theorem. 

□ 
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